Optimizing Provenance Computations
نویسندگان
چکیده
Data provenance is essential for debugging query results, auditing data in cloud environments, and explaining outputs of Big Data analytics. A well-established technique is to represent provenance as annotations on data and to instrument queries to propagate these annotations to produce results annotated with provenance. However, even sophisticated optimizers are often incapable of producing efficient execution plans for instrumented queries, because of their inherent complexity and unusual structure. Thus, while instrumentation enables provenance support for databases without requiring any modification to the DBMS, the performance of this approach is far from optimal. In this work, we develop provenancespecific optimizations to address this problem. Specifically, we introduce algebraic equivalences targeted at instrumented queries and discuss alternative, equivalent ways of instrumenting a query for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization (CBO) framework that governs the application of these optimizations and implement this framework in our GProM provenance system. Our CBO is agnostic to the plan space shape, uses a DBMS for cost estimation, and enables retrofitting of optimization choices into existing code by adding a few LOC. Our experiments confirm that these optimizations are highly effective, often improving performance by several orders of magnitude for diverse provenance tasks.
منابع مشابه
Reorganizing Workflow Evolution Provenance
The provenance of related computations presents the opportunity to better understand and explore the differences and similarities of various approaches. As users design and refine workflows, evolution provenance captures the relationships between workflows as actions that mutate one workflow to another. However, such provenance may not always be the most compact or intuitive. This paper present...
متن کاملProvenance Query Patterns for Many-Task Scientific Computing
Provenance information enable the analysis of large scale many-task computations often specified as scientific workflows. They allow for one to determine how each resulting data set was derived from other data sets and applications. In this work, we survey queries used for exploring provenance information about many-task computations. We present a set of patterns that can be identified in these...
متن کاملES3: A Demonstration of Transparent Provenance for Scientific Computation
The Earth System Science Server (ES3) is a software environment for data-intensive Earth science, with unique capabilities for automatically and transparently capturing and managing the provenance of arbitrary computations. Transparent acquisition avoids the scientist having to express their computations in specific languages or schemas for provenance to be available. ES3 models provenance as r...
متن کاملProvenance-based reproducibility in the Semantic Web
Reproducibility is a crucial property of data since it allows users to understand and verify how data was derived, and therefore allows them to put their trust in such data. Reproducibility is essential for science, because the reproducibility of experimental results is a tenet of the scientific method, but reproducibility is also beneficial in many other fields, including automated decision ma...
متن کاملProvenance-Only Integration
As provenance records are collected from an increasingly diverse set of sources, the need to integrate them grows. The alternative approach of reconciling semantics scales when the records are queried infrequently. However, as the use of provenance grows, normalizing the diverse provenance via formal integration will yield better query performance. We describe two motivating cases for integrati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1701.05513 شماره
صفحات -
تاریخ انتشار 2016